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(£) Audio visual dubbing system and method. 

@ A system and method for replacing the origi- 
nal sound track of a video or film sequence 
depicting a dubbee with an audio signal indica- 
tive of substituted utterances by a dubber as- 
sociates frames of a stored or transmitted 
sequence with facial feature information as- 
sociated with utterances in a language spoken 
by the dubber. The frames of the sequence are 
modified by conforming mouth formations of 
the dubbee in accordance with the facial feat- 
ure information using a look up table relating 
detected dubber utterances to a set of speaker 
independent mouth formations or actual mouth 
formations of the dubber. In accordance with 
the present invention, a viewer of a cunently 
transmitted or previously stored program may 
manually select between viewing the original 
broadcast or program or viewing a dubbed 
version in which a second audio track indicative 
of utterances in a language different than that 
of the dubbee is reproduced. Once such a 
selection is made, the second audio track is 
utilized to conform the mouth movements of the 
dubbee to those of someone making the dub- 
ber*s utterance. 
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Background Of The Invention 

Field of the Invention 

This invention relates generally to the recording 
of audio on the sound tracks of video or film media 
and, more particularly, to a method and apparatus for 
conforming the mouth movements of speaking actors 
depicted or represented in such media to be consis- 
tent with those of a speaker of a foreign language to 
be "dubbed" or substituted therefor. 

Description of the Related Art 

Various techniques have been proposed for pro- 
viding a translation of a video or film sound track into 
another language. The most common method is dub- 
bing, i.e. substituting audio in the second language 
for that of the original. For example, in U.S. Patent No. 
3,743,391 entitled SYSTEM FOR DUBBING FRESH 
SOUND TRACKS ON MOTION PICTURE FILM, 
there is disclosed a dubbing technique in which a vid- 
eo tape recording is made of the original motion pic- 
ture in synchronization therewith. The tape drives a 
television display while cue information is recorded 
on the tape or marked on the motion picture film. The 
tape is played back, and the cue information is used 
to prompt the recording of the desired sound informa- 
tion, including the dialog in the other language. The 
recording of the sound information is done in seg- 
ments and recorded along different laterally dis- 
placed longitudinal areas of the tape or film, and an 
effort is made to juxtapose the segments so that they 
correspond to related image recording. 

More recently, specialized software programs 
have been marketed which enable digitized video 
frame sequences to be stored and manipulated by 
computer. Utilizing such programs, sequences of vid- 
eo frames can be displayed on a monitor and selected 
audio signal segments can be precisely aligned with 
them. Thus, it is now possible to achieve the best pos- 
sible match between a dubber's speech and an ac- 
tor's visual cues (e.g. gestures, facial expressions, 
and the like) in a video frame sequence depicting the 
corresponding speaking actor or "dubbee". 

Regardless of the specific technique employed, 
however, there are certain limitations associated with 
foreign language dubbing which cannot be overcome 
by precise control of audio segment placement. Spe- 
cifically, the movements of the dubbee's mouth tend 
to be inconsistent with the dubber's speech. Such in- 
consistency can be extremely distracting to the view- 
er, particularly when the types of mouth formations 
and lip movements are very different in the respec- 
tive languages. 

Accordingly, it is an object of the present inven- 
tion to provide a visual-audio dubbing technique 
which makes it possible to conform the mouth move- 



ments of the dubbee to the mouth movements asso- 
ciated with the language being substituted. That is, 
the dubbee's mouth movements are modified so that 
they are consistent with the utterances of the dubber. 

5 

Summary of the Invention 

The aforementioned object, as well as others 
which will hereinafter become apparent to those skil- 

10 led in that art, is achieved by an audio-visual dubbing 
system and method utilizing speech recognition and 
facial modeling techniques. 

A system and method for replacing the original 
sound track of a video or film sequence depicting a 

ta dubbee with an audio signal indicative of substituted 
utterances by a dubber associates frames of a stored 
or transmitted sequence with facial feature informa- 
tion associated with utterances in a language spoken 
by the dubber. The frames of the sequence are modi- 

20 f ied by conforming mouth formations of the dubbee in 
accordance with the facial feature information using 
a lookup table relating detected dubber utterances to 
a set of speaker independent mouth formations or ac- 
tual mouth formations of the dubber. In accordance 

25 with one aspect of the present invention, a viewer of 
a currently transmitted or previously stored program 
may manually select between viewing the original 
broadcast or program or viewing a dubbed version in 
which a second audio track indicative of utterances in 

30 a language different than that of the dubbee is repro- 
duced. Once such a selection is made, the second au- 
dio track is utilized to conform the mouth movements 
of the dubbee to those of someone making the dub- 
ber's utterance. 

35 An apparatus for performing audio-visual dub- 
bing in accordance with the present invention in- 
cludes monitoring means for detecting audio signal 
portions indicative of dubber utterances. Each signal 
portion corresponds to a mouth formation orviseme 

40 associated with a language spoken by the dubber. By 
performing speech recognition on each signal por- 
tion, it is possible to determine whether the utterance 
to be associated with a frame corresponds to a pho- 
neme, homophene, or other sound which requires the 

45 speaker to utilize a particular, visually recognizable 
mouth formation. Mouth formation parameters, which 
are utilized to modify respective frames of the original 
sequence, may be extracted from images of the dub- 
ber or from images of a plurality of different persons 

50 uttering the phonemes or other speech segments 
which coincide with the dubber's speech. In either 
event, these parameters are de-normalized to the 
scale of corresponding features in the original frame 
and texture mapping is performed to obtain a modi- 

55 f ied frame in which the dubbee appears to be making 
the utterance attributable to the dubber. 

It will, of course, be understood by those skilled 
in the art that other facial information may be previ- 



2 



3 



EP0 674 315 A1 



4 



ously extracted and stored for use in conforming the 
dubbee's appearance to simulate utterance of the 
speech substituted by the dubber. As such, the pres- 
ent invention may utilize associating means operative 
to associate positions of the jaw, tongue, and teeth 
with respective portions of the audio signal. 

The various features of novelty which character- 
ize the invention are pointed out with particularity in 
the claims annexed to and forming a part of this dis- 
closure. For a better understanding of the invention, 
its operating advantages, and specific objects at- 
tained by its use, reference should be had to the draw- 
ing and descriptive matter in which there are illustrat- 
ed and described preferred embodiments of the in- 
vention. 

Brief Description Of The Drawings 

The features and advantages of the present in- 
vention will be more readily understood from the fol- 
lowing detailed description when read in light of the 
accompanying drawings in which: 

FIG. 1 is a flowchart depicting the various steps 
of an illustrative embodiment of a speech assist- 
ed audio-visual dubbing technique according to 
the present invention; 

FIG. 2 is a flowchart depicting the various steps 
of an alternate embodiment of a speech assisted 
audio-visual dubbing technique according to the 
present invention; 

FIG. 3 is a block diagram showing the various ele- 
ments of an audio-visual dubbing system con- 
structed in accordance with the present inven- 
tion; and 

FIG. 4 is a block diagram showing the elements 
of a video display system utilizing the audiovisual 
dubbing technique of the present invention. 

Detailed Description Of The Preferred 
Embodiments 

In FIG. 1, block 10 indicates retrieval of a frame 
of a digitized video sequence depicting at least one 
speaking person. Techniques for digitizing video or 
film are well known and commercially available and 
are not, therefore, deemed to constitute a novel as- 
pect of the present invention. Accordingly, a detailed 
description of the same has been omitted for clarity. 
In any event, it will be readily appreciated by those 
skilled in the art that synchronized with the video se- 
quence is a corresponding original audio signal track 
representative of the speech and other sounds made 
by the actor(s) depicted. As indicated above, it is a 
principal object of the present invention to provide a 
system and method in which portions of the original 
audio signal representing the original language of the 
actors can be replaced or "dubbed" by audio signal 
portions representative of another language with a 



minimum of visual distraction to the viewer. In accor- 
dance with the technique depicted in FIG. 1, this ob- 
jective is achieved by modifying the mouth move- 
ments and, if desired, other facial features of the ac- 
5 tor, to conform with the mouth movements which 
would be made by a person speaking the language 
supplied by the dubber. 

With continuing reference to FIG. 1 , it will be seen 
in block 14 that feature extraction is performed on a 

10 retrieved frame in accordance with a suitable image 
feature extraction algorithm. An image feature ex- 
traction technique specifically concerned with analy- 
sis of lips, for example, is described in U.S. Patent No. 
4,975,960 issued to Eric D. Petajan on Dec. 4, 1990 

15 and entitled ELECTRONIC FACIAL TRACKING AND 
DETECTION SYSTEM AND METHOD AND APPA- 
RATUS FOR AUTOMATED SPEECH RECOGNI- 
TION. During feature extraction, the retrieved frame 
is analyzed to determine the positions and critical di- 

20 mansions corresponding to facial features, such as 
the lips, eyes, and jaw, which predictably vary during 
speaking. In its simplest form, the analysis is con- 
cerned solely with movements of the actor's lips. 
However, it will be readily ascertained that for more 

25 realistic adaptation, the formations of the tongue, 
teeth, eyes, and jaw should also be considered. It is 
believed that suitable modeling techniques for this 
purpose have already been proposed by others and 
that a detailed discussion of the precise modelling 

30 technique is not necessary here. Reference may, 
however, be had to a paper presented by Shigeo Mor- 
oshima et al. at the 1 989 ICASSP in Glasgow, UK en- 
titled "An Intelligent Facial Image Coding Driven by 
Speech and Phoneme", the disclosure of which is ex- 

35 pressly incorporated herein by reference. In that pa- 
per, there is described a 3-D facial modelling techni- 
que in which the geometric surface of the actor's face 
is defined as a collection of polygons (e.g. triangles). 
In any event, once feature extraction has been 

40 performed, it will then be possible, in accordance with 
the present invention, to adapt the frame image of the 
actor or "dubbee" to simulate utterances in the lan- 
guage of the dubber. According to the embodiment 
depicted in FIG. 1, the aforementioned adaptation is 

45 achieved by analyzing the audio signal portion indica- 
tive of the dubber's speech during the frame, as indi- 
cated in block 16. The manner in which the dubber's 
speech is synchronized with the original video frame 
sequence should not seriously affect the results ob- 

50 tained by the inventive technique disclosed herein. 
Thus, the dubbing track may be recorded in its entire- 
ty in advance and aligned with the video sequence us- 
ing a commercially available software program such 
as Adobe Premier, by Adobe Systems Incorporated, 

55 or it may be recorded during the frame adaptation 
process, sequence by sequence. In either case, the 
speech signal analysis, which may be performed by 
a conventional speech recognition circuit (not 
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shown), need not be full context-level recognition. 
This is true because the purpose of the analysis is to 
break down the dubber's utterance(s) into a se- 
quence of phonemes. Essentially, these phonemes 
can be mapped into distinct, visible mouth shapes 5 
known as visemes. In a simplified version of this em- 
bodiment, the audio signal is analyzed to identify 
homophenes contained in the dubber's utterances. 
Essentially, a homophene is a set of phonemes that 
are produced in a similar manner by the speaker such w 
that the positions of the lips, teeth, and tongue are vis- 
ually similar to an observer. Of course, if a higher de- 
gree of performance is required, context level speech 
recognition may be performed and the phoneme in- 
formation can be extracted therefrom. 15 

In accordance with the embodiment of FIG. 1, a 
modified frame is generated by modifying the para- 
metric facial model obtained by feature extraction via 
the phoneme data obtained by speech recognition. As 
indicated in block 18, this may advantageously be 20 
achieved by addressing a look-up table containing 
parametric position data corresponding to each vi- 
seme. Since preserving picture quality is of substan- 
tial importance, the detail of the information con- 
tained in the look-up table should contain information 25 
relating to particular fecial features, such as lip, teeth, 
and eye positions for each viseme. 

* The mouth positions that people use to pro- 
nounce each phoneme are generally speaker-de- 
pendent. Accordingly, the look-up table utilized in 30 
block 1 8 may contain speaker-independent facial fea- 
ture information. In this event, dubber-speech adap- 
tation of video frame sequences in accordance with 
the present invention requires de-normalization or 
scaling of the stored feature information to that ob- 35 
tained from the original frame by image feature ex- 
traction, as shown in block 20. De-normalization 
merely requires determining the position of selected 
feature points of each relevant facial feature of the 
speaker and scaling the corresponding look-up table 40 
position parameter data accordingly. The location of 
such feature points about the mouth, for example, is 
described in the Morishima et al. reference discussed 
above. 

As shown in block 22, once a first phoneme is 45 
identified from the audio signal indicative of the dub- 
ber's speech and the stored speaker-independent fa- 
cial features corresponding thereto are de-normal- 
ized, incremental texture mapping of facial reflec- 
tance data acquired from the original frame is per- 50 
formed to alter the mouth formation of the actor to ap- 
pear that he or she is uttering the phoneme or homo- 
phene. Essentially, texture mapping techniques are 
well known in the art and may, for example, include 
interpolating texture coordinates using an affine 55 
transformation. For an in-depth discussion of one 
such technique, reference may be had to a paper by 
H. Choi et al. entitled "Analysis and Synthesis of Fa- 



cial Expressions in Knowledge-Based Coding of Fa- 
cial Image Sequences", International Conference on 
Acoustics Speech and Signal Processing, pp. 2737- 
40(1991). 

As indicated in blocks 24 and 26, a modified 
frame is thus generated from the original frame and 
stored. The foregoing steps are repeated for each 
frame in the sequence until the end of the sequence 
is reached, as indicated in steps 28 and 30. It will, of 
course, be understood by those skilled in the art that 
various modifications to the embodiment depicted in 
FIG. 1 are possible. For example, although visemes 
may be modelled as speaker-independent for the pur- 
poses of the present invention, it is possible to en- 
hance the performance of the frame adaptation proc- 
ess. Thus, in a modified embodiment, instead of util- 
izing the default look-up table containing speaker-in- 
dependent viseme data as described above, a speak- 
er-dependent look-up table may be derived through 
analysis of the original audio signal portions that are 
indicative of phonemes and that correspond to trans- 
mitted or stored frames. Each time a phoneme (or 
other speech parameter indicative of a mouth forma- 
tion) common to the language of the dubber and dub- 
bee is detected, feature extraction is performed on 
the corresponding frame image(s) and feature posi- 
tion parameters are stored. In this manner, a speaker 
dependent table may be constructed for each actor. 
Of course, it may still be necessary to utilize a look 
up table in the event phonemes not found in the lan- 
guage of the dubbee are present in the dubber's 
speech. 

Another embodiment of the present invention is 
depicted in FIG. 2. In accordance with this further em- 
bodiment, the mouth formations of the dubbee are 
conformed to those of the dubber. Thus, as shown in 
FIG. 2, blocks 40 and 44 are identical to blocks 10 and 
14 of FIG. 1. However, instead of performing speech 
recognition on the audio signal corresponding to the 
dubber to obtain simulated mouth position parameter 
data, it is the actual mouth formations of the dubber 
himself (or herself) which are utilized. That is, the 
mouth of the dubber is recorded on video during the 
recording of the dubbing audio portion. Thus, as 
shown in block 46, image feature extraction is per- 
formed on the mouth of the dubber. More particularly, 
once a temporal relationship is established between 
the audio speech of the dubber and the frame se- 
quence depicting the dubbee, the facial parameters 
(i.e. mouth formation data) are extracted on a frame 
by frame basis. The extracted parameter data is de- 
normalized (block 48), the original frame is texture 
mapped (block 49), and a modified frame is generat- 
ed (block 50). As in the embodiment of FIG. 1 , the vid- 
eo sequence is modified frame by frame until the last 
frame of the sequence has been stored (blocks 52, 
54, and 56). 

An illustrative audio-visual dubbing system 60 
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constructed in accordance with an illustrative em- 
bodiment of the present invention is depicted in FIG. 
3. As shown in FIG. 3, digitized video signals indica- 
tive of original video frame sequences are sequential- 
ly retrieved by frame retrieval module 61. A feature 
extraction module 62 performs Image feature extrac- 
tion on each retrieved frame in the manner discussed 
above. Meanwhile, speech recognition module 64, 
which may be a conventional speech recognition cir- 
cuit, analyzes the audio signal to identify the phonem- 
ic or homophemc content. As indicated, appropriate 
visemes and other facial information of the speaker 
occurring between transmitted frames can be reliably 
predicted from the phonemic content. It will be readily 
ascertained by those skilled in the art that to facilitate 
analysis of the dubber's speech, the audio signal may 
be previously recorded and synchronized with the vid- 
eo sequence. In the illustrative embodiment depicted 
in FIG. 3, an audio signal stored in this manner is re- 
trieved from and output to speech recognition module 
64 by audio signal retrieval module 63. 

As discussed above, when a particular phoneme 
or homophene is detected in module 64, frame mod- 
ifying module 66 addresses feature position generat- 
ing module 68 to obtain facial position parameter data 
indicative of facial features such as mouth formations 
(visemes), eye, cheek, and jaw positions, and the like 
which correspond to features and feature positions of 
a person uttering the phoneme or homophene. As in- 
dicated above, the facial feature information need not 
be limited to speaker- independent facial feature pos- 
ition parameters and may, in fact, include information 
obtained by monitoring the phonemic content of the 
original audio signal representing the dubbee's 
speech. 

Frame modifying module 66, which may be con- 
figured to include a conventional video signal gener- 
ator, utilizes the original frame and the position para- 
meter information provided by module 68 to generate 
a modified frame. The position parameter data is first 
de-normalized by the frame modifying module to con- 
form dimensionally to those of the original frame. 
Modified frames are sequentially stored until an en- 
tire sequence has been generated. 

With reference now to FIG. 4, there is shown a 
video display system 80 constructed in accordance 
with a further modified embodiment of the present in- 
vention. In accordance with this additional embodi- 
ment, a viewer of a currently transmitted television 
broadcast or previously stored program may manually 
select between viewing the original broadcast or pro- 
gram along with a first synchronized audio signal rep- 
resenting the original speech or program or viewing a 
dubbed version in which a second audio track indica- 
tive, representative, or incorporating utterances in a 
language different than that of the dubbee is repro- 
duced. Once selection is made, the second audio 
track is utilized to conform the mouth movements of 



the dubbee to those of someone making the dubber's 
utterance. 

As shown in FIG. 4, the system 80 includes a first 
receiver 82 for receiving a video signal defining a se- 

5 quence of frames depicting the dubbee, and a second 
receiver 84 for receiving a plurality of audio signals 
synchronized with the video signal. As will be readily 
ascertained by those skilled in the art, receiver 84 is 
adapted to receive a first audio signal that corre- 

10 spends to speech in the language spoken by the dub- 
bee as well as at least one other audio signal also syn- 
chronized with the video signal and indicative of ut- 
terances in another language supplied by the dubber. 
Receiver 84 is coupled to sound reproducing means 

15 86 and is adapted to provide one of the received audio 
signals thereto. A manually operable selector switch 
88 permits the viewer to hear the program in his na- 
tive language by controlling which audio signal track 
will be supplied to and reproduced by reproducing 

20 means 86. 

If the viewer wishes to view a program as original- 
ly broadcast or stored - that is, without dubbing ~ 
switch 88 is positioned accordingly and the video sig- 
nal is processed in a. conventional manner and dis- 

25 played on a suitable display means as the picture 
tube 90. Similarly, the first audio signal is output to 
reproducing means 86, which may be configured as 
one or more conventional audio speakers. If, on the 
other hand, the viewer wishes to view the program 

30 dubbed into another language, the position of switch 
88 is changed and operation in accordance with the 
inventive methods described above is initiated. 

If the original video signal is an analog signal, it 
may be digitized by an A/D converter (not shown). In 

35 the embodiment depicted in FIG. 4, it is assumed that 
the original signal is received in digital form. Thus, as 
shown, the input video signal is input directly to a buf- 
fer 92 which stores the incoming signal portions and 
supplies them to frame modification means 94 in a 

40 conventional manner. Similarly, the input audio signal 
is input to an audio signal buffer 93. In accordance 
with one of the techniques discussed above, the re- 
spective frames are modified to simulate mouth for- 
mations consistent with the utterances of the dubber 

45 and these are output to picture tube 90 by frame mod- 
ification means 94 in a conventional manner. 

It will, of course, be readily appreciated by those 
skilled in the art that a wide variety of modifications 
may be utilized to even further enhance the quality of 

so video interpolation accorded by the present invention. 
It should therefore be understood that the invention is 
not limited by or to the embodiments described above 
which are presented as examples only but may be 
modified in various ways within the scope of protec- 

55 tion defined by the appended patent claims. 
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Claims 

1 . A system for replacing the original sound track of 
a video or film sequence formed of a plurality of 
frames depicting a dubbee, with an audio signal 
indicative of substituted utterances by a dubber, 
characterized by: 

frame modifying means, responsive to an 
audio signal associated with utterances of the 
dubber, for sequentially modifying frames of said 
sequence to conform therewith; and 

means for associating respective portions 
of said audio signal with facial feature informa- 
tion. 

2. The apparatus according to claim 1 1 further char- 
acterized by monitoring means for detecting said 
audio signal portions, each signal portion corre- 
sponding to a mouth formation associated with 
language spoken by the dubber. 

3. The apparatus according to claim 2, character- 
ized in that at least some of said signal portions 
comprise phonemes. 

4. The apparatus according to claim 2, character- 
ized in that at least some of said signal portions 
comprise homophones. 

5. The apparatus according to claim 2, character- 
ized in that each said mouth formation is a vi- 
seme. 

6. The apparatus according to claim 2, character- 
ized in that said mouth formations are mouth for- 
mations of the dubber. 



10. The apparatus according to claim 1, character- 
ized in that said associating means is operative 
to associate predetermined positions of at least 
one of the jaw, tongue, and teeth of a speaker 

5 with respective portions of the audio signal. 

11. A method of replacing the original sound track of 
a video or film sequence formed of a plurality of 
frames and depicting a dubbee, with an audio sig- 

10 nal indicative of substituted utterances by a dub- 
ber, characterized by the steps of: 

associating frames of the sequence with 
facial feature information associated with utter- 
ances in a language spoken by the dubber; and 

15 sequentially modifying the frames of the 

sequence by conforming mouth formations of the 
dubbee in accordance with said facial feature in- 
formation. 

20 1 2. The method of clai m 11 , characterized in that said 
associating step includes 

monitoring an audio signal portion indica- 
tive of an utterance by the dubber and corre- 
sponding to a frame of the sequence to be match- 
es ed with the utterance; and 

identifying individual facial feature para- 
meters based on the audio signal portion. 

13. The method of claim 12, characterized in that 
30 said individual facial feature parameters are de- 
rived by image feature extraction from video 
frames indicative of a person speaking the utter- 
ance of the dubber. 

35 14. The method of claim 13, characterized in that 
said person is the dubber. 



The apparatus according to claim 2, character- 
ized in that said associating means includes a 
memory having stored therein a speaker-inde- 40 
pendent table of respective mouth formation 
parameter data for respective dubber utterances. 

The apparatus according to claim 2, character- 
ized in that said associating means includes 45 
means responsive to said monitoring means for 
storing dubbee-dependent mouth position para- 
meter data indicative of respective mouth posi- 
tions as corresponding signal portions are detect- 
ed by said monitoring means. so 

The apparatus according to claim 2, character- 
ized in that said associating means includes 
means responsive to said monitoring means for 
storing dubber-dependent mouth position para- 55 
meter data indicative of respective mouth posi- 
tions as corresponding signal portions are detect- 
ed by said monitoring means. 



15. The method of claim 11, further characterized by 
the step of storing sets of mouth formation data 
of the dubber during said utterances and wherein 
said modifying step includes identifying individ- 
ual facial features of the dubber corresponding to 
an utterance associated with a frame of the se- 
quence. 

16. A system for displaying video images indicative of 
a dubbee person speaking one of a plurality of 
languages, characterized by: 

first receiving means for receiving a video 
signal defining a sequence of frames depicting a 
dubbee; 

second receiving means for receiving a 
plurality of audio signals, including a first audio 
signal, synchronized with said video signal and 
corresponding to speech in the language spoken 
by the dubbee; 

frame modifying means, responsive to a 
second audio signal synchronized with the video 
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signal and indicative of utterances in another lan- 
guage supplied by the dubber, for sequentially 
modifying frames of said sequence to conform 
therewith; and 

means for associating respective portions 5 
of said audio signal with facial feature informa- 
tion. N 

1 7. The system of claim 1 6, characterized in that said 
audio and video signals are transmitted in digital 10 
form. 

18. The system of claim 16, further characterized by 
buffer means for storing portions of said transmit- 
ted video and audio signals and for repetitively 15 
supplying portions of the video signal and the 
second audio signal corresponding to said frame 
modifying means. 

19. The system of claim 1 7, further characterized by 

display means for displaying video images 
of the dubbee; 

speaker means for reproducing said audio 
signals; and 

selection means operatively associated 
with said frame modification means, said first re- 
ceiving means, and said second receiving means 
for operating said display means and speaker 
means in a first mode in which the frame se- 
quence received by said first receiving means is 
displayed and said first audio signal is repro- 
duced and in a second mode, in which a frame se- 
quence supplied by the frame modification 
means is displayed and said second audio signal 
is reproduced. 
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